For this exercise, I will participate in the TidyTuesday data exploration and analysis. The data is gathered from the TidyTuesday Github repository and loaded as a package. I will wrangle and explore the data. The conduct model fitting and finally test the data.
Set Up
First, I will start by loading all necessary packages and the data.
Load Packages
library(broom)library(ggplot2)library(here)
here() starts at /Users/mutsa_n/Desktop/MADA-course/mutsanyamuranga-MADA-portfolio
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(jsonlite)
Attaching package: 'jsonlite'
The following object is masked from 'package:purrr':
flatten
library(janitor)
Attaching package: 'janitor'
The following objects are masked from 'package:stats':
chisq.test, fisher.test
Next, I will conduct some exploratory data anaylsis using graphs, plots and tables. To start, we have to understand what the data is and what it is tracking.
Data Structure
# Look at full summary of each data setsummary(eclipse_annular_2023)
state name lat lon
Length:811 Length:811 Min. :27.22 Min. :-124.45
Class :character Class :character 1st Qu.:31.30 1st Qu.:-111.98
Mode :character Mode :character Median :35.42 Median :-106.70
Mean :35.41 Mean :-108.05
3rd Qu.:38.42 3rd Qu.:-101.36
Max. :44.87 Max. : -96.72
eclipse_1 eclipse_2 eclipse_3 eclipse_4
Length:811 Length:811 Length:811 Length:811
Class1:hms Class1:hms Class1:hms Class1:hms
Class2:difftime Class2:difftime Class2:difftime Class2:difftime
Mode :numeric Mode :numeric Mode :numeric Mode :numeric
eclipse_5 eclipse_6
Length:811 Length:811
Class1:hms Class1:hms
Class2:difftime Class2:difftime
Mode :numeric Mode :numeric
summary(eclipse_partial_2023)
state name lat lon
Length:31363 Length:31363 Min. :17.96 Min. :-176.60
Class :character Class :character 1st Qu.:35.36 1st Qu.: -97.50
Mode :character Mode :character Median :39.56 Median : -89.26
Mean :38.80 Mean : -91.97
3rd Qu.:41.93 3rd Qu.: -81.14
Max. :71.25 Max. : 174.11
eclipse_1 eclipse_2 eclipse_3 eclipse_4
Length:31363 Length:31363 Length:31363 Length:31363
Class1:hms Class1:hms Class1:hms Class1:hms
Class2:difftime Class2:difftime Class2:difftime Class2:difftime
Mode :numeric Mode :numeric Mode :numeric Mode :numeric
eclipse_5
Length:31363
Class1:hms
Class2:difftime
Mode :numeric
summary(eclipse_partial_2024)
state name lat lon
Length:28844 Length:28844 Min. :17.96 Min. :-176.60
Class :character Class :character 1st Qu.:35.24 1st Qu.: -99.08
Mode :character Mode :character Median :39.52 Median : -90.30
Mean :38.76 Mean : -93.00
3rd Qu.:42.04 3rd Qu.: -81.16
Max. :71.25 Max. : 174.11
eclipse_1 eclipse_2 eclipse_3 eclipse_4
Length:28844 Length:28844 Length:28844 Length:28844
Class1:hms Class1:hms Class1:hms Class1:hms
Class2:difftime Class2:difftime Class2:difftime Class2:difftime
Mode :numeric Mode :numeric Mode :numeric Mode :numeric
eclipse_5
Length:28844
Class1:hms
Class2:difftime
Mode :numeric
summary(eclipse_total_2024)
state name lat lon
Length:3330 Length:3330 Min. :28.45 Min. :-101.16
Class :character Class :character 1st Qu.:35.42 1st Qu.: -92.41
Mode :character Mode :character Median :39.24 Median : -86.56
Mean :38.33 Mean : -86.93
3rd Qu.:41.22 3rd Qu.: -82.31
Max. :46.91 Max. : -67.43
eclipse_1 eclipse_2 eclipse_3 eclipse_4
Length:3330 Length:3330 Length:3330 Length:3330
Class1:hms Class1:hms Class1:hms Class1:hms
Class2:difftime Class2:difftime Class2:difftime Class2:difftime
Mode :numeric Mode :numeric Mode :numeric Mode :numeric
eclipse_5 eclipse_6
Length:3330 Length:3330
Class1:hms Class1:hms
Class2:difftime Class2:difftime
Mode :numeric Mode :numeric
# Look at names of variablesnames(eclipse_annular_2023)
#Look at number of rows and columns in each data setnrow(eclipse_annular_2023)
[1] 811
nrow(eclipse_partial_2023)
[1] 31363
nrow(eclipse_partial_2024)
[1] 28844
nrow(eclipse_total_2024)
[1] 3330
ncol(eclipse_annular_2023)
[1] 10
ncol(eclipse_partial_2023)
[1] 9
ncol(eclipse_partial_2024)
[1] 9
ncol(eclipse_total_2024)
[1] 10
From the first looks, we see that there are four data sets with two that contain data from an annular eclipse in 2023 and two that contain data from a total eclipse in 2024. The variables captured are the location of the eclipse and time of day, capturing the state, name of the city, longitude and latitude. The five or six eclipse variables are the time of day at which the which the moon contacts the sun at the location at various points of the eclipse. For example, eclipse_3 is time at which annularity begins in this location in 2023 and time at which totality begins in this location in 2024.
Feature Engineering
I added a column for duration of visibility in minutes for all solar eclipses from first to last contact.
# Duration of the eclipse to total eclipse of 2024eclipse_total_2024<- eclipse_total_2024 %>%mutate(eclipse_1_time =hms(eclipse_1),eclipse_6_time =hms(eclipse_6),duration =as.numeric(eclipse_6_time - eclipse_1_time)/60 )# Duration of the eclipse to annular eclipse of 2023eclipse_annular_2023<- eclipse_annular_2023 %>%mutate(eclipse_1_time =hms(eclipse_1),eclipse_6_time =hms(eclipse_6),duration =as.numeric(eclipse_6_time - eclipse_1_time)/60 )# Duration of the eclipse to partial eclipse of 2024eclipse_partial_2024<- eclipse_partial_2024 %>%mutate(eclipse_1_time =hms(eclipse_1),eclipse_5_time =hms(eclipse_5),duration =as.numeric(eclipse_5_time - eclipse_1_time)/60 )# Duration of the eclipse to the partial eclipse of 2024eclipse_partial_2023<- eclipse_partial_2023 %>%mutate(eclipse_1_time =hms(eclipse_1),eclipse_5_time =hms(eclipse_5),duration =as.numeric(eclipse_5_time - eclipse_1_time)/60 )
I will add a column of eclipse year in each of the data sets so that each observation can be identified as to which year the eclipse was from and also another column of eclipse type for the purpose of plotting.
# Identifier to the Total Eclipse 2024 data eclipse_total_2024 <-mutate(eclipse_total_2024, eclipse_type='Total_2024', eclipse_year='2024')# Identifier to the Annular Eclipse 2023 data eclipse_annular_2023 <-mutate(eclipse_annular_2023, eclipse_type='Annular_2023', eclipse_year ='2023')# Identifier to the Partial Eclipse 2024 data eclipse_partial_2024 <-mutate(eclipse_partial_2024, eclipse_type='Partial_2024',eclipse_year='2024')# Identifier to the Partial Eclipse 2023 data eclipse_partial_2023 <-mutate(eclipse_partial_2023, eclipse_type='Partial_2023',eclipse_year='2023')
Now, I will merge all of the datasets into one data set with all the 4 datasets by rows and kept state, city name, lattitude, longitude, duration and eclipse year in the final data. I converted the eclipse_year to a factor variable.
#Combining all the data sets by roweclipse_all<-bind_rows(eclipse_total_2024, eclipse_annular_2023, eclipse_partial_2024,eclipse_partial_2023 )%>%#Selecting relevant columnsselect(state, name, lat, lon, duration, eclipse_year, eclipse_type)%>%#convert to factormutate(eclipse_year=factor(eclipse_year))
Visualization
I will look further into how this data works is associated using graphs.
Scatter Plot
# Scatter plot of eclipse duration by latitudeggplot(eclipse_all, aes(x = lon, y = duration, color = eclipse_type)) +geom_point() +labs(title ="Eclipse Duration by Longitude",x ="Longitude",y ="Duration (minutes)") +scale_color_manual(values =c("Total_2024"="blue", "Annular_2023"="green", "Partial_2024"="orange", "Partial_2023"="red")) +theme_minimal()
Map
# Define color palette for each eclipse typeeclipse_colors <-c("Partial_2024"="orange", "Partial_2023"="red", "Annular_2023"="green", "Total_2024"="blue")# Create a leaflet mapmap1 <-leaflet() %>%addTiles() %>%# Add default OpenStreetMap tiles as the base layersetView(lng =-95.7129, lat =37.0902, zoom =2) # Set initial view to focus on the world# Create a leaflet mapmap1 <-leaflet() %>%addTiles() %>%# Add default OpenStreetMap tiles as the base layersetView(lng =-95.7129, lat =37.0902, zoom =2) # Set initial view to focus on the world# Add eclipse locations as markers to the mapmap1 <- map1 %>%addCircleMarkers(data = eclipse_all, lng =~lon, lat =~lat,radius =5, color =~eclipse_colors[eclipse_type],popup =~paste("Location: ", name, "<br/>", "Eclipse Type: ", eclipse_type, "<br/>", "Duration (minutes): ", duration))# Display the mapmap1
Input to asJSON(keep_vec_names=TRUE) is a named vector. In a future version of jsonlite, this option will not be supported, and named vectors will be translated into arrays instead of objects. If you want JSON object output, please use a named list instead. See ?toJSON.
Input to asJSON(keep_vec_names=TRUE) is a named vector. In a future version of jsonlite, this option will not be supported, and named vectors will be translated into arrays instead of objects. If you want JSON object output, please use a named list instead. See ?toJSON.